This document is a data science report of the kaggle house prices tutorial project. It was generated using the Shapash library.
Version : 0.7
Name : House Prices Prediction Project
Purpose : Predicting the sale price of houses
Date : 2021-04-07
Contributors : Yann Golhen, Sebastien Bidault, Thomas Bouche, Guillaume Vignal, Thibaud Real
Description : This work is a data science project that tries to predict the sale of houses based on 79 explanatory variables. It was designed inside the data science team at X. and improved since the beggining of the project in 20019. The model was put into production since February 2021.
Source Code : https://github.com/MAIF/shapash/tree/master/tutorial
Origin : The Assessor’s Office
Description : the sale of individual residential property in Ames, Iowa
Depth : from 2006 to 2010
Perimeter : only residential sales
Target Variable : SalePrice
Target Description : The property's sale price in dollars
Variable Filetring : All variables that required special knowledge or previous calculations for their use were removed
Individual Filtering : only the most recent sales data on any property were kept (for houses that were sold multiple times during this period)
Missing Values : were replaced by 0
Created Variables : No feature was created. All features are directly taken from the kaggle dataset
Transformed Variables : Categorical features were transformed using an ordinal encoder
Model used : RandomForestRegressor
Library : sklearn.ensemble._forest
Library version : 0.23.2
Model parameters :
| Parameter key | Parameter value |
|---|---|
| base_estimator | DecisionTreeRegressor() |
| n_estimators | 50 |
| estimator_params | ('criterion', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'min_weight_fraction_leaf', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'random_state', 'ccp_alpha') |
| bootstrap | True |
| oob_score | False |
| n_jobs | None |
| random_state | None |
| verbose | 0 |
| warm_start | False |
| class_weight | None |
| max_samples | None |
| criterion | mse |
| max_depth | None |
| Parameter key | Parameter value |
|---|---|
| min_samples_split | 2 |
| min_samples_leaf | 1 |
| min_weight_fraction_leaf | 0.0 |
| max_features | auto |
| max_leaf_nodes | None |
| min_impurity_decrease | 0.0 |
| min_impurity_split | None |
| ccp_alpha | 0.0 |
| n_features_in_ | 72 |
| n_features_ | 72 |
| n_outputs_ | 1 |
| base_estimator_ | DecisionTreeRegressor() |
| estimators_ | [DecisionTreeRegressor(max_features='auto', random_state=1844014871), DecisionTreeRegressor(max_features='auto', random_state=79755033), DecisionTreeRegressor(max_features='auto', random_state=1237390286), DecisionTreeRegressor(max_features='auto', random_state=675509169),... |
| Training dataset | Prediction dataset | |
|---|---|---|
| number of features | 72 | 72 |
| number of observations | 1095 | 365 |
| missing values | 0 | 0 |
| % missing values | 0 | 0 |
First Floor square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095 | 365 |
| mean | 1180 | 1120 |
| std | 400 | 341 |
| min | 334 | 483 |
| 25% | 886 | 864 |
| 50% | 1100 | 1050 |
| 75% | 1420 | 1320 |
| max | 4690 | 2630 |
Second floor square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095 | 365 |
| mean | 342 | 362 |
| std | 435 | 442 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 0 | 0 |
| 75% | 728 | 728 |
| max | 1870 | 2060 |
Three season porch area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095.00 | 365.00 |
| mean | 3.72 | 2.48 |
| std | 31.60 | 21.10 |
| min | 0.00 | 0.00 |
| 25% | 0.00 | 0.00 |
| 50% | 0.00 | 0.00 |
| 75% | 0.00 | 0.00 |
| max | 508.00 | 245.00 |
Bedrooms above grade
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 8 | 7 |
| missing values | 0 | 0 |
Type of dwelling
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 5 |
| missing values | 0 | 0 |
General condition of the basement
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Refers to walkout or garden level walls
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Type 1 finished square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095 | 365 |
| mean | 453 | 416 |
| std | 465 | 429 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 399 | 350 |
| 75% | 722 | 678 |
| max | 5640 | 2100 |
Type 2 finished square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095.0 | 365 |
| mean | 44.7 | 52 |
| std | 162.0 | 159 |
| min | 0.0 | 0 |
| 25% | 0.0 | 0 |
| 50% | 0.0 | 0 |
| 75% | 0.0 | 0 |
| max | 1470.0 | 1060 |
Rating of basement finished area
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 6 |
| missing values | 0 | 0 |
Rating of basement finished area (if present)
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 6 |
| missing values | 0 | 0 |
Basement full bathrooms
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 3 |
| missing values | 0 | 0 |
Basement half bathrooms
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 3 | 3 |
| missing values | 0 | 0 |
Height of the basement
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Unfinished square feet of basement area
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095 | 365 |
| mean | 570 | 558 |
| std | 446 | 429 |
| min | 0 | 0 |
| 25% | 224 | 217 |
| 50% | 483 | 464 |
| 75% | 812 | 796 |
| max | 2340 | 2040 |
Central air conditioning
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
Proximity to various conditions
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 9 | 8 |
| missing values | 0 | 0 |
Proximity to other various conditions
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 7 | 3 |
| missing values | 0 | 0 |
Electrical system
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 4 |
| missing values | 0 | 0 |
Enclosed porch area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095.0 | 365.0 |
| mean | 23.0 | 18.8 |
| std | 63.2 | 54.5 |
| min | 0.0 | 0.0 |
| 25% | 0.0 | 0.0 |
| 50% | 0.0 | 0.0 |
| 75% | 0.0 | 0.0 |
| max | 552.0 | 272.0 |
Exterior materials' condition
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 3 |
| missing values | 0 | 0 |
Exterior materials' quality
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Exterior covering on house
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 14 | 12 |
| missing values | 0 | 0 |
Other exterior covering on house
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 16 | 14 |
| missing values | 0 | 0 |
Number of fireplaces
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Type of foundation
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 5 |
| missing values | 0 | 0 |
Full bathrooms above grade
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Home functionality
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 7 | 5 |
| missing values | 0 | 0 |
Size of garage in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095 | 365 |
| mean | 475 | 466 |
| std | 211 | 222 |
| min | 0 | 0 |
| 25% | 329 | 336 |
| 50% | 480 | 466 |
| 75% | 576 | 576 |
| max | 1420 | 1390 |
Garage condition
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 4 |
| missing values | 0 | 0 |
Interior finish of the garage?
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 3 | 3 |
| missing values | 0 | 0 |
Garage quality
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 4 |
| missing values | 0 | 0 |
Garage location
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 6 |
| missing values | 0 | 0 |
Year garage was built
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095.0 | 365.0 |
| mean | 1980.0 | 1980.0 |
| std | 26.6 | 25.5 |
| min | 1870.0 | 1900.0 |
| 25% | 1960.0 | 1960.0 |
| 50% | 1980.0 | 1980.0 |
| 75% | 2000.0 | 2000.0 |
| max | 2010.0 | 2010.0 |
Ground living area square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095 | 365 |
| mean | 1520 | 1490 |
| std | 530 | 512 |
| min | 334 | 605 |
| 25% | 1130 | 1130 |
| 50% | 1470 | 1460 |
| 75% | 1790 | 1730 |
| max | 5640 | 4480 |
Half baths above grade
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 3 | 3 |
| missing values | 0 | 0 |
Type of heating
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 3 |
| missing values | 0 | 0 |
Heating quality and condition
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 5 |
| missing values | 0 | 0 |
Style of dwelling
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 8 | 8 |
| missing values | 0 | 0 |
Kitchens above grade
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 3 | 3 |
| missing values | 0 | 0 |
Kitchen quality
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Flatness of the property
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Slope of property
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 3 | 3 |
| missing values | 0 | 0 |
Lot size square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095 | 365 |
| mean | 10600 | 10300 |
| std | 10200 | 9410 |
| min | 1300 | 1600 |
| 25% | 7500 | 7740 |
| 50% | 9500 | 9300 |
| 75% | 11600 | 11500 |
| max | 215000 | 165000 |
Lot configuration
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 5 |
| missing values | 0 | 0 |
General shape of property
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Low quality finished square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095.00 | 365.00 |
| mean | 5.18 | 7.83 |
| std | 46.40 | 54.70 |
| min | 0.00 | 0.00 |
| 25% | 0.00 | 0.00 |
| 50% | 0.00 | 0.00 |
| 75% | 0.00 | 0.00 |
| max | 572.00 | 513.00 |
Building Class
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 15 | 15 |
| missing values | 0 | 0 |
General zoning classification
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 5 |
| missing values | 0 | 0 |
Masonry veneer area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095 | 365 |
| mean | 101 | 109 |
| std | 173 | 203 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 0 | 0 |
| 75% | 168 | 145 |
| max | 1600 | 1380 |
Masonry veneer type
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
$Value of miscellaneous feature
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095.0 | 365.0 |
| mean | 51.5 | 19.3 |
| std | 569.0 | 111.0 |
| min | 0.0 | 0.0 |
| 25% | 0.0 | 0.0 |
| 50% | 0.0 | 0.0 |
| 75% | 0.0 | 0.0 |
| max | 15500.0 | 1200.0 |
Month Sold
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 12 | 12 |
| missing values | 0 | 0 |
Physical locations within Ames city limits
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 25 | 25 |
| missing values | 0 | 0 |
Open porch area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095.0 | 365.0 |
| mean | 46.9 | 46.1 |
| std | 67.6 | 62.1 |
| min | 0.0 | 0.0 |
| 25% | 0.0 | 0.0 |
| 50% | 26.0 | 24.0 |
| 75% | 66.0 | 72.0 |
| max | 547.0 | 341.0 |
Overall condition of the house
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 8 | 9 |
| missing values | 0 | 0 |
Overall material and finish of the house
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 10 | 10 |
| missing values | 0 | 0 |
Paved driveway
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 3 | 3 |
| missing values | 0 | 0 |
Pool area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 3 |
| missing values | 0 | 0 |
Roof material
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 7 | 5 |
| missing values | 0 | 0 |
Type of roof
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 6 |
| missing values | 0 | 0 |
Condition of sale
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 6 |
| missing values | 0 | 0 |
Type of sale
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 9 | 8 |
| missing values | 0 | 0 |
Screen porch area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095.0 | 365.0 |
| mean | 16.4 | 11.0 |
| std | 58.0 | 48.3 |
| min | 0.0 | 0.0 |
| 25% | 0.0 | 0.0 |
| 50% | 0.0 | 0.0 |
| 75% | 0.0 | 0.0 |
| max | 480.0 | 396.0 |
Type of road access
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
Total rooms above grade
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 12 | 10 |
| missing values | 0 | 0 |
Total square feet of basement area
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095 | 365 |
| mean | 1070 | 1030 |
| std | 453 | 392 |
| min | 0 | 0 |
| 25% | 799 | 780 |
| 50% | 996 | 972 |
| 75% | 1320 | 1240 |
| max | 6110 | 2630 |
Type of utilities available
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 1 |
| missing values | 0 | 0 |
Wood deck area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095 | 365 |
| mean | 96 | 89 |
| std | 124 | 131 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 0 | 0 |
| 75% | 168 | 164 |
| max | 736 | 857 |
Original construction date
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095.0 | 365.0 |
| mean | 1970.0 | 1970.0 |
| std | 30.3 | 29.8 |
| min | 1870.0 | 1880.0 |
| 25% | 1950.0 | 1950.0 |
| 50% | 1970.0 | 1970.0 |
| 75% | 2000.0 | 2000.0 |
| max | 2010.0 | 2010.0 |
Remodel date
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095.0 | 365.0 |
| mean | 1990.0 | 1980.0 |
| std | 20.6 | 20.7 |
| min | 1950.0 | 1950.0 |
| 25% | 1970.0 | 1960.0 |
| 50% | 1990.0 | 1990.0 |
| 75% | 2000.0 | 2000.0 |
| max | 2010.0 | 2010.0 |
Year Sold
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 5 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1095 | 365 |
| mean | 182000 | 177000 |
| std | 78500 | 82000 |
| min | 34900 | 40000 |
| 25% | 130000 | 126000 |
| 50% | 165000 | 160000 |
| 75% | 215000 | 205000 |
| max | 755000 | 745000 |
Note : the explainability graphs were generated using the test set only.
First Floor square feet
Second floor square feet
Three season porch area in square feet
Bedrooms above grade
Type of dwelling
General condition of the basement
Refers to walkout or garden level walls
Type 1 finished square feet
Type 2 finished square feet
Rating of basement finished area
Rating of basement finished area (if present)
Basement full bathrooms
Basement half bathrooms
Height of the basement
Unfinished square feet of basement area
Central air conditioning
Proximity to various conditions
Proximity to other various conditions
Electrical system
Enclosed porch area in square feet
Exterior materials' condition
Exterior materials' quality
Exterior covering on house
Other exterior covering on house
Number of fireplaces
Type of foundation
Full bathrooms above grade
Home functionality
Size of garage in square feet
Garage condition
Interior finish of the garage?
Garage quality
Garage location
Year garage was built
Ground living area square feet
Half baths above grade
Type of heating
Heating quality and condition
Style of dwelling
Kitchens above grade
Kitchen quality
Flatness of the property
Slope of property
Lot size square feet
Lot configuration
General shape of property
Low quality finished square feet
Building Class
General zoning classification
Masonry veneer area in square feet
Masonry veneer type
$Value of miscellaneous feature
Month Sold
Physical locations within Ames city limits
Open porch area in square feet
Overall condition of the house
Overall material and finish of the house
Paved driveway
Pool area in square feet
Roof material
Type of roof
Condition of sale
Type of sale
Screen porch area in square feet
Type of road access
Total rooms above grade
Total square feet of basement area
Type of utilities available
Wood deck area in square feet
Original construction date
Remodel date
Year Sold
| True values | Prediction values | |
|---|---|---|
| count | 365 | 365 |
| mean | 177000 | 176000 |
| std | 82000 | 68200 |
| min | 40000 | 77400 |
| 25% | 126000 | 129000 |
| 50% | 160000 | 160000 |
| 75% | 205000 | 199000 |
| max | 745000 | 480000 |
Mean absolute error : 16773.08
Mean squared error : 775516588.1